Text Classi cation using String Kernels

نویسندگان

  • Huma Lodhi
  • Craig Saunders
  • Leslie Pack Kaelbling
چکیده

We propose a novel approach for categorizing text documents based on the use of a special kernel. The kernel is an inner product in the feature space generated by all subsequences of length k. A subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously. The subsequences are weighted by an exponentially decaying factor of their full length in the text, hence emphasising those occurrences that are close to contiguous. A direct computation of this feature vector would involve a prohibitive amount of computation even for modest values of k, since the dimension of the feature space grows exponentially with k. The paper describes how despite this fact the inner product can be e ciently evaluated by a dynamic programming technique. Experimental comparisons of the performance of the kernel compared with a standard word feature space kernel Joachims (1998) show positive results on modestly sized datasets. The case of contiguous subsequences is also considered for comparison with the subsequences kernel with di erent decay factors. For larger documents and datasets the paper introduces an approximation technique that is shown to deliver good approximations e ciently for large datasets.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

METER: MEasuring TExt Reuse

In this paper we present results from the METER (MEasuring TExt Reuse) project whose aim is to explore issues pertaining to text reuse and derivation, especially in the context of newspapers using newswire sources. Although the reuse of text by journalists has been studied in linguistics, we are not aware of any investigation using existing computational methods for this particular task. We inv...

متن کامل

A Practical View of Suboptimal Bayesian Classification with Radial Gaussian Kernels

For pattern classi cation in a multi dimensional space the minimum misclassi cation rate is obtained by using the Bayes criterion Kernel estimators or probabilistic neural networks provide a good way to evaluate the probability densities of each class of data and are an interesting parallel implementation of the Bayesian classi er However their training procedure leads to a very high number of ...

متن کامل

Text and Picture Segmentation by the Distribution Analysis of Wavelet

Statistical classi cation is an important topic in image processing. Classi cation helps to interpret images, and it can be incorporated into other image processing algorithms, e.g., image compression [1], to improve performance. A particularly interesting type of classi cation is the segmentation of pictures and text. By pictures, we mean continuous-tone images such as photographs. By text, we...

متن کامل

Text Classification from Labeled and Unlabeled Documents Using

This paper shows that the accuracy of learned text classi ers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classi cation problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available. We introduce an algorithm for learning from lab...

متن کامل

Hypertext Categorization using Hyperlink Patterns and Meta Data

Hypertext poses new text classi cation research challenges as hyperlinks, content of linked documents, and meta data about related web sites all provide richer sources of information for hypertext classi cation that are not available in traditional text classi cation. We investigate the use of such information for representing web sites, and the e ectiveness of di erent classi ers (Naive Bayes,...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003